0.0.1 Preparation and assume knowledge

  • Have a look at the Lab and make sure you have all packages installed on your computer.
  • For the exercises in this lab, we encourage you will work in groups as it will expose you to different views, ideas, and opinions. That being said, please make an effort to complete the lab by typing your own code and (most importantly) expressing your own conclusions and interpretations.

0.0.2 Aims

  • Extract and perform basic data pre-processing on order book snapshots. Students are expected to be able to read in all datasets into R as well as performing basic data wraggling to extract relevant information.
  • Use order book snapshots to compute statistics like the BidAskSpread, log return, and realised volatility in a specific time interval.
  • Perform initial data analysis with visualization tools.
  • The ability to work with basic time-series packages in R.
  • Assess a time-series predictor.
  • Interpret the prediction results both quantitatively and qualitatively.
  • Submit 3.2 for formative feedback.


1 Realised volatility

Volatility is one of the most prominent terms you’ll hear on any trading floor and for good reason. In financial markets, volatility captures the amount of fluctuation in prices. High volatility is associated to periods of market turbulence and to large price swings, while low volatility describes more calm and quiet markets. For trading firms like Optiver, accurately predicting volatility is essential for the trading of options, whose price is directly related to the volatility of the underlying product. In this exercise, we will explore the order book dataset provided by Optiver to predict future realised volatility. An order book lists the number of shares being bid on or offered at each price point. For simplicity, we only provider the highest two bid prices and lowest two ask prices in the order book (which would not affect computation of the realised volatility). Base on the order book, we need to be able to compute statistics like the weighted average price (WAP) and log returns. We will then calculate the realised volatility during any fixed time interval based on the log returns. Finally, we will use historical data to estimate future volatility.

library(ggplot2)
library(dplyr)
library(rugarch)

1.1 Import data

The data we look at today are standard csv files provided by Optiver. Each csv file corresponds to order book snapshots of a specific stock at some specific time points. With one-second resolution, it provides a uniquely fine grained look at the micro-structure of modern financial markets. We will load the data for one stock (“stock_1.csv”) in this lab, and full data sets of 127 stocks are available to students for analysis if they select this project.

Each stock has consecutive order book snapshots (with one-second resolution) for multiple 10-min time intervals. Each row of the csv file is an order book snapshot of a stock at a specific time point. The column time_id indicates the ID code for the 10-min time bucket. Time IDs are not necessarily sequential but are consistent across all stocks. Each row of the csv file also provides the highest (lowest) two bid (ask) prices and its corresponding sizes at the specific time point.

Suggestions

stock1 <- read.csv("data/individual_book_train/stock_1.csv")
dim(stock1)
## [1] 1507532      11
head(stock1)
##   time_id seconds_in_bucket bid_price1 ask_price1 bid_price2 ask_price2
## 1       5                 0   1.000754   1.001542   1.000689   1.001607
## 2       5                 1   1.000754   1.001673   1.000689   1.001739
## 3       5                 2   1.000754   1.001411   1.000623   1.001476
## 4       5                 3   1.000754   1.001542   1.000689   1.001607
## 5       5                 4   1.000754   1.001476   1.000623   1.001542
## 6       5                 5   1.000754   1.001542   1.000623   1.001673
##   bid_size1 ask_size1 bid_size2 ask_size2 stock_id
## 1         1        25        25       100        1
## 2        26        60        25       100        1
## 3         1        25        25       125        1
## 4       125        25       126        36        1
## 5       100       100        25        25        1
## 6       100        25        25        60        1

1.2 Compute statistics from book data

Compute the BidAskSpread and WAP for all the snapshots of stock 1. Add the results as two new columns of stock1. Compute the log return of stock 1 when time_id == 5 and seconds_in_bucket == 16.

Suggestions

stock1 <- stock1 %>% mutate(
  WAP = (bid_price1 * ask_size1 + ask_price1 * bid_size1) / (bid_size1 + ask_size1))
stock1 <- stock1 %>% mutate(BidAskSpread = ask_price1 / bid_price1 - 1)
head(stock1)
##   time_id seconds_in_bucket bid_price1 ask_price1 bid_price2 ask_price2
## 1       5                 0   1.000754   1.001542   1.000689   1.001607
## 2       5                 1   1.000754   1.001673   1.000689   1.001739
## 3       5                 2   1.000754   1.001411   1.000623   1.001476
## 4       5                 3   1.000754   1.001542   1.000689   1.001607
## 5       5                 4   1.000754   1.001476   1.000623   1.001542
## 6       5                 5   1.000754   1.001542   1.000623   1.001673
##   bid_size1 ask_size1 bid_size2 ask_size2 stock_id      WAP BidAskSpread
## 1         1        25        25       100        1 1.000785 0.0007868064
## 2        26        60        25       100        1 1.001032 0.0009179074
## 3         1        25        25       125        1 1.000780 0.0006556053
## 4       125        25       126        36        1 1.001411 0.0007868064
## 5       100       100        25        25        1 1.001115 0.0007212558
## 6       100        25        25        60        1 1.001384 0.0007868064
# To compute the log return, we should know the WAP at two consecutive times
WAP1 <- stock1 %>% filter(time_id == 5 & seconds_in_bucket == 16) %>% pull(WAP)
WAP2 <- stock1 %>% filter(time_id == 5 & seconds_in_bucket == 15) %>% pull(WAP)
log_r <- log(WAP1 / WAP2)
print(log_r)
## [1] -0.0003073623

1.3 Compute volatility and visulisation

We will first compute the log returns at each time points. Based on these log returns, we can then calculate the realised volatility during any specific time interval. In this lab, we set every (non-overlapping) 30 seconds as a fixed time interval to compute realised volatility.

[a] We consider the first 500 time_id of stock 1 for illustration. Describe what do the following lines of code do? (This chunk may take up to 1 min to run.)

log_r1 <- list()
time_IDs <- unique(stock1[, 1])[1:500]
for (i in 1 : length(time_IDs)) {
  sec <- stock1 %>% filter(time_id == time_IDs[i]) %>% pull(seconds_in_bucket)
  price <- stock1 %>% filter(time_id == time_IDs[i]) %>% pull(WAP)
  log_r <- log(price[-1] / price[1:(length(price) - 1)])
  log_r1[[i]] <- data.frame(time = sec[-1], log_return = log_r)
  time.no.change <- (1:600)[!(1:600 %in% log_r1[[i]]$time)]
  if (length(time.no.change) > 0) {
    new.df <- data.frame(time = time.no.change, log_return = 0)
    log_r1[[i]] <- rbind(log_r1[[i]], new.df)
    log_r1[[i]] <- log_r1[[i]][order(log_r1[[i]]$time), ]
  }
}

Solutions Lines 1 and 2: obtain the first 500 time_id for stock 1 and initialise a list to store the log returns. Line 3: start a for loop. Lines 4 - 7: extract WAPs to compute log returns and create a data frame storing the log returns and the corresponding time points. Lines 8 - 13: find the time points at which the WAP doesn’t change; thus the log returns are 0 at these time points. We add these zero log returns into the data frame to facilitate later analysis.

[b] Divide the 10-min horizon into non-overlapping 30-second intervals. Similar to the above code, use a list to store the volatility every 30 seconds of stock 1 for different time buckets.

Suggestions

vol <- list()
comp_vol <- function(x) {
  return(sqrt(sum(x ^ 2)))
}
for (i in 1 : length(log_r1)) {
  log_r1[[i]] <- log_r1[[i]] %>% mutate(time_bucket = ceiling(time / 30))
  vol[[i]] <- aggregate(log_return ~ time_bucket, data = log_r1[[i]], FUN = comp_vol)
  colnames(vol[[i]]) <- c('time_bucket', 'volatility')
}

[c] visualize the 10-min time series for log returns and realised volatility of stock 1 with time_id == 5.

Suggestions

ggplot(data = log_r1[[1]], aes(x = time, y = log_return)) + geom_line() 

ggplot(data = vol[[1]], aes(x = time_bucket, y = volatility)) + geom_line() + geom_point() 

2 Estimation and prediction

In this lab, we will examine three potential methods to predict future volatility: regression analysis, HAV-RV model, ARMA-GARCH model. The three methods use different strategies to fit the data. The regression analysis doesn’t treat the realised volatility as a time series but puts weights based on recency. The HAV-RV directly works on the time series of realised volatility. The ARMA-GARCH models the time series on log returns and indirectly predict the volatility.

2.1 Autocorrelation

Before fitting the data, we conduct a preliminary analysis on the autocorrelation function (ACF) of the time series on log returns and realised volatility. Autocorrelation is the correlation between a time series with a lagged version of itself. The ACF starts at a lag of 0, which is the correlation of the time series with itself and therefore results in a correlation of 1.

[a] Plot the ACF of the time series on log returns and realised volatility, respectively, for stock 1 with time_id == 5.

Suggestions

acf(log_r1[[1]]$log_return, main = "ACF plot for log returns")

acf(vol[[1]]$volatility, main = "ACF plot for realised volatility")

2.2 Regression analysis

We now regress the volatility in period \(t\) on the mean prices in period \(t-1\), mean number of orders in period \(t-1\), and the mean BidAskSpread in period \(t-1\). We still use stock 1 with time_id == 5 for illustration.

[a] We use the data during the first 8 minutes of each 10-min time bucket as the training data and use the remaining 2-min data for validation. Create two lists vol.train and vol.val that separately store the training and validation data in vol. The \(i\)th element of vol.train (vol.val) corresponds to the training (validation) data for vol[[i]].

Suggestions

vol.train <- list()
vol.val <- list()

for (i in 1 : length(log_r1)) {
  vol.train[[i]] <- vol[[i]][1:16, ]
  vol.val[[i]] <- vol[[i]][-(1:16), ]
}

[b] Recall that we study the first 500 time_id for stock 1, and each time_id corresponds to a specific 10-min time bucket. Create a list with length 500, where each element of the list is a data frame for a different time_id. All the data frames have four columns: the realised volatility in the training data set and the corresponding mean prices, mean number of orders, and mean BidAskSpread in the previous period. (We don’t need the first realised volatility in any 10-min time bucket because there is no covariates for this period.)

Suggestions

list.reg <- list() # list for regression
stock1 <- stock1 %>% mutate(time_bucket = ceiling(seconds_in_bucket / 30),
                            num_order = bid_size1 + ask_size1 + bid_size2 + ask_size2)
len.train <- length(vol.train[[1]]$volatility)

for (i in 1 : length(vol)) {
  stats.bucket <- stock1 %>% 
    filter(time_id == time_IDs[i] & time_bucket != 0) %>% 
    select(c(BidAskSpread, WAP, num_order, time_bucket)) 
  # for each 30-sec time bucket, we compute the following statistics
  mean.price <- aggregate(WAP ~ time_bucket, data = stats.bucket, FUN = mean)
  mean.order <- aggregate(num_order ~ time_bucket, data = stats.bucket, FUN = mean)
  mean.BAS <- aggregate(BidAskSpread ~ time_bucket, data = stats.bucket, FUN = mean)
  list.reg[[i]] <- data.frame(volatility = vol.train[[i]]$volatility[-1], 
                              price = mean.price$WAP[1:(len.train - 1)],
                              order = mean.order$num_order[1:(len.train - 1)],
                              BidAskSpread = mean.BAS$BidAskSpread[1:(len.train - 1)])
}

[c] Use weighted least squares to fit the model when time_id == 5. Assume we are now at 8 min of a 10-min bucket, and we weight the observations by recency with weight \(0.8^{8-t}\) (time unit: min). Use a list with length 500 to store all the models, where each element of the the list is a linear regression model of one of the 500 time_id.

Suggestions

lm.models <- list()

for (i in 1 : length(vol)) {
  lm.models[[i]] <- lm(volatility ~ price + order + BidAskSpread, list.reg[[i]],
                       weights = 0.8 ^ (((len.train - 2):0) / 2))
}

# for some periods, linear regression performs well
summary(lm.models[[162]])
## 
## Call:
## lm(formula = volatility ~ price + order + BidAskSpread, data = list.reg[[i]], 
##     weights = 0.8^(((len.train - 2):0)/2))
## 
## Weighted Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.608e-04 -5.448e-05  3.796e-05  6.759e-05  1.985e-04 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -4.557e-01  1.274e-01  -3.578  0.00433 **
## price         4.559e-01  1.271e-01   3.588  0.00426 **
## order        -1.334e-07  6.337e-07  -0.211  0.83711   
## BidAskSpread -1.277e-01  4.033e-01  -0.317  0.75752   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0001356 on 11 degrees of freedom
## Multiple R-squared:  0.7411, Adjusted R-squared:  0.6704 
## F-statistic: 10.49 on 3 and 11 DF,  p-value: 0.001477

2.3 HAV-RV model

The HAV-RV model is popular in predicting long-term volatility and may not perform well in short-term prediction. Nevertheless, we include it as a benchmark in the lab. Since the realised volatility time series only has 20 time points, we only use the volatility at \(t-1\) and mean volatility from \(t-5\) to \(t-1\) for fitting the volatility at time \(t\).

[a] Similar to Question (b) in Section 2.2, create a list length 500, where each element of the list is a data frame for a different time_id. All the data frames have three columns: the realised volatility and the corresponding realised volatility in the previous period and the mean realised volatility during the previous five periods. (The first row starts from the 6th time points for each 10-min time bucket because we need the mean realised volatility in the past 5 periods.)

Suggestions

list.HAV <- list()

for (i in 1 : length(vol)) {
  mean.vol <- rep(0, len.train - 5)
  for (j in 1 : 5) {
    mean.vol <- mean.vol + vol.train[[i]]$volatility[j : (j + len.train - 6)] / 5
  }
  list.HAV[[i]] <- data.frame(vol = vol.train[[i]]$volatility[-(1:5)], 
                              vol_1 = vol.train[[i]]$volatility[5:(len.train - 1)],
                              mean_vol_5 = mean.vol)
}

[b] Fit the HAV model by using ordinary least squares (OLS) and weighted least squares (WLS). For WLS, we use \(w_t=\text{RV}_{t-1}/\sqrt{\text{RQ}_{t-1}}\) as the weight for each time period \(t\). Similar to Question (c) in Section 2.2, use two lists named HAV.ols.models and HAV.wls.models (each with length 500) to store all the models for every 10-min time bucket.

Suggestions

quar <- list()
comp_quar <- function(x) {
  return(length(x) / 3 * sum(x ^ 4))
}
for (i in 1 : length(log_r1)) {
  quar[[i]] <- aggregate(log_return ~ time_bucket, data = log_r1[[i]], FUN = comp_quar)
  colnames(quar[[i]]) <- c('time_bucket', 'quarticity')
}

HAV.ols.models <- list()
HAV.wls.models <- list()

for (i in 1 : length(vol)) {
  HAV.ols.models[[i]] <- lm(vol ~ vol_1 + mean_vol_5, list.HAV[[i]])
  
  # weight.HAV <- 0
  HAV.wls.models[[i]] <- lm(vol ~ vol_1 + mean_vol_5, list.HAV[[i]],
                            weights = list.HAV[[i]]$vol_1 / 
                              sqrt(quar[[i]]$quarticity[5:(len.train - 1)]))
}

# HAV-RV performs well for some time buckets
summary(HAV.wls.models[[218]])
## 
## Call:
## lm(formula = vol ~ vol_1 + mean_vol_5, data = list.HAV[[i]], 
##     weights = list.HAV[[i]]$vol_1/sqrt(quar[[i]]$quarticity[5:(len.train - 
##         1)]))
## 
## Weighted Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0056587 -0.0015535  0.0000333  0.0015861  0.0050228 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.002436   0.000420   5.801 0.000405 ***
## vol_1       -0.188253   0.231292  -0.814 0.439242    
## mean_vol_5  -4.691461   1.180870  -3.973 0.004102 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.003545 on 8 degrees of freedom
## Multiple R-squared:  0.856,  Adjusted R-squared:   0.82 
## F-statistic: 23.77 on 2 and 8 DF,  p-value: 0.0004302

2.4 ARMA-GARCH model

The ARMA-GARCH model directly works on the time series of log returns.

[a] Use rugarch to fit the log return time series from 0 to 480 seconds (first 80% of the data) for each time_id. Specifically, we use \(\text{ARMA}(1, 1)\) and \(\text{GARCH}(1, 1)\) for conditional mean and variance, respectively, and adopt a normal distribution for noises. Like Questions (c) in Section 2.2 and (b) in Section 2.3, use a list with length 500 to store all 500 models (corresponding to the 500 time_id).

Hint: rugarch has a function ugarchfit to fit an ARMA-GARCH model. There are many algorithms for fitting the data, and an example is to use the hybrid solver for this function.

Suggestions

spec <- ugarchspec(variance.model = list(model = "sGARCH", garchOrder = c(1, 1)), 
                   mean.model = list(armaOrder = c(1, 1)), 
                   distribution.model = "norm")
ARMA_GARCH.models <- list()
  
for (i in 1 : length(vol)) {
  ARMA_GARCH.models[[i]] <- ugarchfit(spec = spec, data = log_r1[[i]] %>% 
                                        filter(time <= 480) %>% pull(log_return),
                                      solver = 'hybrid')
}

[b] For each of the 500 time_id of stock 1, simulate 1000 paths of log returns in the next 30 seconds and compute the corresponding realised volatility. Average them to get an estimate of the realised volatility in the next period.

Hint: Type ? ugarchpath to see how to simulate a path based on a fitted ARMA-GARCH model.

Suggestions

RV.pred <- rep(0, length(vol))
for (i in 1 : length(vol)) {
  fspec <- getspec(ARMA_GARCH.models[[i]])
  setfixed(fspec) <- as.list(coef(ARMA_GARCH.models[[i]]))
  future.path <- fitted(ugarchpath(fspec, n.sim = 30, m.sim = 1000))
  # Due to numerical issues, sometimes NA value can be produced 
  # We simply replace NA value with 0; you may come up with a better idea in your own project
  future.path[is.na(future.path)] <- 0 
  RV.pred[i] <- mean(sqrt(colSums(future.path ^ 2)))
}

3 Assessing the model

We will compute two performance measures, QLIKE and MSE, from the prediction results under the one-shot estimation scheme for prediction. For illustration, this lab only computes the prediction performance measures for the first method in Section 2, i.e., the regression method. Remember we have stored all the fitted regression models in a list lm.models in Section 2.2.

3.1 Prediction based on regression

[a] Use the fitted linear regression models to predict the realised volatility in the validation data set.

Suggestions

list.reg.val <- list()
len.val <- length(vol.val[[1]]$volatility)
pred.lm <- list()

for (i in 1 : length(vol)) {
  stats.bucket <- stock1 %>% 
    filter(time_id == time_IDs[i] & time_bucket != 0) %>% 
    select(c(BidAskSpread, WAP, num_order, time_bucket))
  mean.price <- aggregate(WAP ~ time_bucket, data = stats.bucket, FUN = mean)
  mean.order <- aggregate(num_order ~ time_bucket, data = stats.bucket, FUN = mean)
  mean.BAS <- aggregate(BidAskSpread ~ time_bucket, data = stats.bucket, FUN = mean)
  list.reg.val[[i]] <- 
    data.frame(volatility = vol.val[[i]]$volatility, 
               price = mean.price$WAP[len.train:(len.train + len.val - 1)],
               order = mean.order$num_order[len.train:(len.train + len.val - 1)],
               BidAskSpread = mean.BAS$BidAskSpread[len.train:(len.train + len.val - 1)])
  pred.lm[[i]] <- predict(lm.models[[i]], newdata = list.reg.val[[i]])
}

3.2 Prediction accuracy

[a] Compute the QLIKE and MSE for all data points in the validation data set. Aggregate the results and plot them as box plots.

Suggestions

MSE.lm <- vector()
QLIKE.lm <- vector()
for (i in 1 : length(vol)) {
  MSE.lm <- c(MSE.lm, mean((vol.val[[i]]$volatility - pred.lm[[i]]) ^ 2))
  QLIKE.lm <- c(QLIKE.lm, mean(vol.val[[i]]$volatility / pred.lm[[i]] - 
                                 log(vol.val[[i]]$volatility / pred.lm[[i]]) - 1))
}

boxplot(MSE.lm, horizontal = TRUE)

boxplot(QLIKE.lm, horizontal = TRUE)